-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcdserver: add watchdog to detect stalled writes #15440
Conversation
b5782da
to
39abdbe
Compare
This is the draft PR per our discussion in the doc @mitake @ptabor @serathius @chaochn47 @fuweid Please let me know if you have any immediate comment or concern, before I continue to add tests. |
39abdbe
to
8a50ca5
Compare
8a50ca5
to
69e29bf
Compare
b8e2988
to
989cce8
Compare
server/watchdog/watchdog.go
Outdated
v.inactiveElapsed++ | ||
if v.inactiveElapsed > wd.inactiveTimeoutTick/2 { | ||
elapsedTime := time.Duration(v.inactiveElapsed*tickMs) * time.Millisecond | ||
wd.lg.Warn("Slow activity detected", zap.String("activity", v.name), zap.Duration("duration", elapsedTime)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please add a gauge metric for slow activities count with activity name as label. I think this will be useful for folks who rely on metrics to actively monitor etcd cluster.
I would like to clarify what is the proposed solution and how it differs from the design proposed by @ptabor and me. Can you please cleanup the design document before people jump on reviewing the code? |
@@ -88,7 +89,9 @@ func (s *Snapshotter) save(snapshot *raftpb.Snapshot) error { | |||
spath := filepath.Join(s.dir, fname) | |||
|
|||
fsyncStart := time.Now() | |||
cancel := watchdog.Register("save v2 snapshot") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to decide the naming better.
a) ```
endActivity := wd.StartActivity("saving v2 snapshot")
endActivity()
b) ```
markAsDone := wd.Start("saving v2 snapshot")
markAsDone()
c) ```
markAsDone := wd.Notify("saving v2 snapshot")
markAsDone()
d) ```
endScope := wd.StartScope("saving v2 snapshot")
endScope()
We might also have a synthactic sugar:
or watchdog.executeInScope("saving v2 snapshot", func() { ...} );
I don't like 'cancel' as it suggests an interruption of an existing process.. and it's not.
I hoped we will settle this (API / naming) in a document rather than keep updating the PR, that is relatively expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the suggestion of synthactic sugar
. I imagine it's similar to bboltDB's View
and Update
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StartXXX
seems not good to me, it doesn't start any activity, instead essentially it just registers the activity to watchdog. So how about Register and unregister?
unregister := wd.Register("saving v2 snapshot")
unregister()
Users can also call the syntactic sugar as you proposed below,
wd.Execute("saving v2 snapshot", fn)
Please also see the updated doc https://docs.google.com/document/d/1U9hAcZQp3Y36q_JFiw2VBJXVAo2dK2a-8Rsbqv3GgDo/edit#heading=h.3oeryohw1c9o
@@ -88,7 +89,9 @@ func (s *Snapshotter) save(snapshot *raftpb.Snapshot) error { | |||
spath := filepath.Join(s.dir, fname) | |||
|
|||
fsyncStart := time.Now() | |||
cancel := watchdog.Register("save v2 snapshot") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what do you think about pre-registering type of activities that are tracked in the watchdog:
{ at the package level:
var SAVING_V2_SNAPSHOTS = watchdog.RegisterActivity("saving v2 snapshots");
}```
endActivity := wd.StartActivity(SAVING_V2_SNAPSHOTS)
endActivity()
The main benefit is that it drives the API usage to small number of strings rather than patterns like:
wd.StartActivity(string.format("Writing log entry: %d", entryId));
thus the API can be e.g. used for monitoring how long classes of activities take in scope of a watchdog.
i.e. watchdog sees that 17% of wall-time it was in the writing-logs routine.
It's a stretch and over design. but in theory such registration allows to build also a hierarchy of 'activities/scope'.
server/watchdog/watchdog.go
Outdated
wd.mu.Lock() | ||
for _, v := range wd.activities { | ||
v.inactiveElapsed++ | ||
if v.inactiveElapsed > wd.inactiveTimeoutTick/2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fundamental difference between this design and watchdog pattern in general is that you propose to monitor a specific registered activities.
I assumed the goal is to monitor watchdog as a whole thing....
Watchdog is unhealthy when for the whole 'timeout' it got no 'Start' nor 'Stop/Done' notifications.
Any such notifications does reset the watchdog. Lack of new activities is especially an alart.
We do register specific activities only to:
a) have actionable error messages:
- the watchdog is stuck with following active activities...
- the watchdog is stuck and the last started / DONE activity is X
b) [stretch] We might piggy back on it and collect metrics what percent of time specifica activities are active state
Thanks all for the feedback, which basically makes sense to me. The overall idea is basically coming from one of @ptabor 's comments in the doc. Will update & cleanup the doc later. I will also keep this PR (just a PoC) in sync with the doc so as to avoid any misunderstanding. One thing I'd like to clarify. Previously I thought it should be an etcd-raft-loop watchdog, and gets notified each time on receiving a ready data struct. The watchdog isn't healthy if it doesn't receive any new notification in the given max-duration. But on second thought, it seems not good, because,
So eventually I changed to monitor only registered activities (e.g. sync WAL log, commit boltDB, etc). Any concern? |
865764b
to
9b6a4a3
Compare
Use watchdog to monitor all storage read/write operations Signed-off-by: Benjamin Wang <[email protected]>
9b6a4a3
to
93f6dcc
Compare
Please also see the updated doc https://docs.google.com/document/d/1U9hAcZQp3Y36q_JFiw2VBJXVAo2dK2a-8Rsbqv3GgDo/edit#heading=h.3oeryohw1c9o |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
@ahrtr: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Hello developers! I am wondering what is the status of this issue right now? Will it be merged to the main branch? |
We decided to go with different design. Etcd will detect stalled writes by checking raft loop execution and expose a |
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.